post thumbnail

Text Similarity Detection in Web Scraping Data Cleaning and Fine-Tuning Models (Part 2)

Advance web scraping data cleaning with text similarity model fine-tuning. Build Sentence-BERT/Siamese encoders in PyTorch, apply contrastive learning, hard-negative mining, and domain-specific augmentation. Evaluate with cosine AUC, MAP, and clustering purity. Deploy embeddings to Elasticsearch or FAISS for vector search, near-duplicate detection, entity consolidation. Includes Python code and reproducible pipeline.

2025-11-11

In the previous article, we introduced text similarity detection techniques based on Levenshtein distance and TF-IDF with cosine similarity.

In this article, we move forward and explore semantic-level text similarity, which plays a critical role in web scraping data cleaning, deduplication, and downstream model fine-tuning.


Sentence-Transformers for Semantic Text Similarity

Sentence-transformers map sentences, paragraphs, or even long documents into a high-dimensional vector space.
In this vector space, semantically similar texts stay closer together, which allows systems to compute similarity efficiently using vector operations such as cosine similarity.

Sentence-transformers build on top of pre-trained language models like BERT and MiniLM. The framework optimizes these models specifically for sentence-level semantic understanding.


How Sentence-Transformers Encode Text

1. Text Encoding with Pre-trained Language Models

Sentence-transformers reuse language models that developers have already trained on massive corpora.
These base models understand vocabulary, grammar, and general semantics before any task-specific training begins.

The encoding process works as follows:

For example, the sentence:

“I love natural language processing”

produces a vector with 384 or 768 dimensions, depending on the selected model.


2. Sentence-Level Semantic Pooling

Pre-trained models generate vectors for each token, not for the entire sentence.
Sentence-transformers combine token vectors into a single sentence embedding using pooling strategies.

Common pooling strategies include:

Mean pooling remains the most widely used approach in production systems.


3. Fine-Tuning for Semantic Relevance

Sentence-transformers improve semantic alignment through fine-tuning on labeled sentence pairs.

During training, the system:

Common optimization strategies include:

After fine-tuning, the model handles tasks such as synonym detection, paraphrase identification, and semantic equivalence more accurately.


4. Similarity Calculation

After generating embeddings, the system calculates similarity using cosine similarity or Euclidean distance.

A cosine similarity score closer to 1 indicates stronger semantic similarity.


Installing Sentence-Transformers

Install the library with pip:

pip install sentence-transformers

The first execution downloads the selected model automatically from Hugging Face.


Basic Example: Semantic Similarity Detection

Test Sentences

text_a = "Artificial intelligence is transforming modern society through automation and data analysis."
text_b = "Machine learning algorithms are changing contemporary culture by automating processes and analyzing information."
text_c = "Climate change affects global weather patterns and requires immediate environmental action."

Similarity Calculation Code

from sentence_transformers import SentenceTransformer, util

MODEL_PATH = "./local-models/all-MiniLM-L6-v2"

def calculate_similarity(text1, text2, model):
    embedding1 = model.encode(text1, convert_to_tensor=True)
    embedding2 = model.encode(text2, convert_to_tensor=True)
    return util.cos_sim(embedding1, embedding2).item()

def main():
    model = SentenceTransformer(MODEL_PATH)

    sim_ab = calculate_similarity(text_a, text_b, model)
    sim_ac = calculate_similarity(text_a, text_c, model)

    print(f"Similarity A–B: {sim_ab:.4f}")
    print(f"Similarity A–C: {sim_ac:.4f}")

if __name__ == "__main__":
    main()

Output Interpretation

Similarity A–B: 0.6630
Similarity A–C: 0.1917

The system correctly identifies that A and B share moderate semantic similarity, while A and C remain unrelated.


Why Fine-Tune Sentence-Transformers?

General-purpose models learn broad semantic rules, but domain-specific tasks require specialization.

Fine-tuning improves performance in scenarios such as:

Empirical results often show 10%–30% performance gains after fine-tuning.


Fine-Tuning Workflow Overview

1. Data Preparation

Prepare labeled sentence pairs with similarity scores between 0 and 1.
High-quality annotations matter more than raw volume.


2. Base Model Selection

Recommended options:


3. Loss Function Selection


4. Training Configuration


Fine-Tuning Example Code

from sentence_transformers import SentenceTransformer, InputExample, losses, evaluation, util
from torch.utils.data import DataLoader
import pandas as pd
import random

data = [
    {"sentence1": "Artificial intelligence is transforming healthcare", "sentence2": "AI is revolutionizing medical services", "score": 0.91},
    {"sentence1": "Natural language processing enables chatbots", "sentence2": "NLP powers conversational AI systems", "score": 0.93},
]

df = pd.DataFrame(data)

train_examples = [
    InputExample(texts=[row["sentence1"], row["sentence2"]], label=row["score"])
    for _, row in df.iterrows()
]

random.shuffle(train_examples)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=2)

model = SentenceTransformer("all-MiniLM-L6-v2")
train_loss = losses.CosineSimilarityLoss(model)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=15,
    warmup_steps=10,
    output_path="./fine-tuned-all-MiniLM-L6-v2"
)

Final Notes